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Abstract 

In this article, we consider convergence rates in functional linear regression 
with functional responses, where the linear coefficient lies in a reproducing 
kernel Hilbert space (RKHS). Without assuming that the reproducing kernel 
and the covariate covariance kernel are aligned, or assuming polynomial rate of 
decay of the eigenvalues of the covariance kernel, convergence rates in prediction 
risk are established. The corresponding lower bound in rates is derived by 
reducing to the scalar response case. Simulation studies and two benchmark 
datasets are used to illustrate that the proposed approach can significantly 
outperform the functional PCA approach in prediction. 

keywords: Functional data; Functional response; Minimax convergence rate; 
Regularization. 



1 Introduction 

The literature contains an impressive range of functional analysis tools for various 
problems including exploratory functional principal component analysis, canonical 
correlation analysis, classification and regression. Two major approa ches exist. The 



more traditional approach, masterfully documented in the monograph (IRamsay and Silvermanl . 



20051 ) . typically starts by representing functional data by an expansion with respect to 
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a certain basis, and subsequent inferences are carried out on the coefficients. The most 
commonly utilized basis include B-spline basis for nonperio dic data and Fourier basi s 
for periodic data. Another line of work by the French school (IFerraty and Vieul . 120021 1 . 
taking a nonparametric point of view, extends the traditional nonparametric tech- 
niques, most notably the kernel estimate, t o the functional case. Some recent advance s 
in the area o f functional re g ression include Cardot et al.l (1 2003) ; Cai and Hall! (120 06) ; 



(120091 ): 



Predal ( 20071): iLianl (20071): 



Ferratv et al.l (120111 ): 



Ait-Saidi et 



Lian 



al. 



(2011). 



(120081); 



Yao et al. 



(120051 ): 



Crambes et al 



In this paper we study the functional linear regression problem of the form 



Y(t)=n{t)+ I P{t,s)X(s)ds + e{t), 
'o 



where Y, A, e 6 L 



( 2005 ): 



Yao et al 



0, 1] and .5je|A| = , the same problem tha t appe a red inlRamsav and Silverman 



(120051 ): 



Antoch et al. 



(120081 ) : lAguilera et all ( 120081 ): 



Crambes and Mas 



( 120121 ). In terms of methodology, the plan o f attack we will give for fll}) is most closely 



related to that of 



Crambes and Mad (l2012l ). In this introduction, we will explain the 



methodology used in that paper and then the different assumption we will make on 
(3(t,s). 

Without loss of much generality, throughout the paper we assume E(X) = and 
the intercept /i(t) = 0, since the intercept can be easily estimated. The covariance 
operator of X is the linear operator T = E(X eg) A) where for x, y £ ^[0, 1], x <g) y : 
L 2 [0, 1] — > L 2 [0, 1] is defined by (x<S>y)(g) = (y,g)x for any g G L 2 [0, 1]. T can also be 
represented by the bivariate function T(s,t) = E[X(s)X(t)}. Using the same letter 
T to denote both the operator and the bivariate function will not cause confusion in 
our context. We assume throughout the paper that i?||A|| 4 < oo which implies T is 
a compact operator. Then by the Karhunen-Loeve Theorem there exists a spectral 
expansion for T, 

oo 
3=1 

where Xj > are the eigenvalues with Xj — > and {fj} are the orthonormalized 
eigenfunctions. Correspondingly, we have the representation A = J2j>i Ij^Pj with 
7j = J X(fj. The random coefficients satisfies Ej^ = Xjl{j = k} where /{.} is 
the indicator function. 



2 



By expanding (3 using the set of eigenfunctions, we write (3(t, s) = J2j>i ^i(^)v 9 i( s ) 
and ([1]) can be equivalently written as 

y(t) = 5>(t) Ti + e(t). 

Multiplying both sides above by jj and taking expectations, we easily obtain bj(t) = 
E[Y (t) r Yj]/\j. Given i.i.d. data (X i? Yj),i = 1, . . . ,n, {Xj, <pj} can be easily estimated 
by Xj and <fij obtained from the spectral decomposition of the empirical covariance 
operator and E\Y(t)jj] can be approximated by the cor responding sample average. 
Thus the estimator proposed in ICrambes and Mad (120121 ) is 



(3(t,s) 



^ n k 



i=l j=l 



Yi(t)<Pj(s). 



Note that the infinite sum over j has been truncated as some point k for regularization. 
One intriguing point is that there is no regularization on Yi(t) necessary, in contrast 



with 



Yaoetal 



(120051 ) where Y is observed sparsely with additional noise. This can 
also be seen from that bj(t) is not a priori constrained in any way. The reason is 
that only regularization of the covariance operator, which does not depend on Y, is 
necessary to avoid overfitting. 

Minimax convergence ra tes of E\\ J (3(t, s)X(s)ds — J f3(t, s)X(s)ds\\ 2 were shown 



in 



Crambes and Mad (120121 ) . A key assumption is the appropriate decaying assump- 



tion on \\bj\\ as j increases. Given that Hull's are the coefficients of /3(t, s) in terms of 
the basis (pj, which is a characteristic of the predictor, there is no a priori reason why 
this basis should provide a good representation of in the sense that will decay 
fast. Indeed, a more reasonable assumption for /3 is on its smoothness, which makes 



(201 2|) for th e scala r response models. While ICrambes and Mad ( 



Yuan and Cai 


(2010 


); 


Cai and Yuan 



Cardot et al 



20121) is based on 



(120071 ) for scalar response models, ours is based on lCai and Yuan! (120121 ). 
The rest of the article is organized as follows. In Section 2, we propose an estima- 
tor for (3 with an RKHS approach where the reproducing kernel and the covariance 
kernel are not necessarily aligned. We establish the minimax rate of convergence in 
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prediction risk by deriving both the upper bound and the lower bound. In Section 
3, we present some simulation studies to show that the RKHS approach could signif- 
icantly outperform the functional PCA approach when the kernels are mis-aligned. 
This advantage is further illustrated on two benchmark datasets which shows better 
prediction performance using our approach. We conclude in Section 4 with some 
discussions. The technical proofs are relegated to the Appendix. 

Finally, we list some notations and properties regarding different norms to be used. 
For any operator J 7 , we use T T to denote its adjoint operator. If T is self-adjoint and 
nonnegative definite, J 71 / 2 is its square- root satisfying J 11 / 2 ^ 1 ^ 2 = T . For / G L 2 , 
ll/H denotes its L 2 norm. For any operator J 7 , ||.F||op is the operator norm ||.F||op := 
supj|y|| <1 || FfW- The trace norm of an operator T is Trace (J 7 ) = X]fc((^ rT ^ r ) 1 ^ 2e fc; e fc) 
for any orthonormal basis {e^} of L 2 . J 7 is a trace class operator if its trace norm is 
finite. The Hilbert-Schmidt norm of an operator is ||J r ||HS' = k^ e ii e k) 2 ) 1 ^ 2 = 
ll^jll 2 ) 1 ^ 2 - A 11 operator is a Hilbert-Schmidt operator if its Hilbert-Schmidt 
norm is finite. From the definition it is easy to see that Trace (J-" 1 " J 7 ) = Trace (J 7 J-" 11 ) = 
|| .FH/rg, Furthermore, if J 7 is a Hilbert-Schmidt operator and Q is a bounded operator, 
then TQ is also a Hilbert-Schmidt operator with ||J 7 ^||_ff5 < ||.F||ffs||£/||op. 



2 Methodology and Convergence Rates 



Following iWahba a RKHS if is a Hilbert space of real-valued functions de- 



fined on, say, the interval [0,1], in which the point evaluation operator L t : H — >■ 
R,L t (f) = f(t) is continuous. By Riesz representation theorem, this definition im- 
plies the existence of a bivariate function K(s, t) such that 

K{s, •) G H, for all s G [0, 1] 
and (reproducing property) 

for every / G H and t G [0, 1], (K(t, ■), f) H = f(t). 



The definition of a RKHS can actually start from a positive definite bivariate function 
K(s, t) and RKHS is constructed as the completion of the linear span of {K(s, •), s G 
[0, 1]} with inner product defined by (K(s, •), K(t, •))# = K(s, t). To make the depen- 
dence on K explicit, the RKHS is denoted by Hk with the RKHS norm || -\\h k - With 
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abuse of notation, K also denotes the linear operator / G L 2 — > Kf = J K(-, s)f(s)ds. 
For later use, we note that Hk is identical to the range of K 1 ' 2 . 

We assume that for any t G [0, 1], (3(t, •) G Hk- This is a smoothness assumption 
for (3(t, s) in the s- variable. As noted in the introduction, smoothness assumption on 
the t- variable is not necessary. We estimate via 

/3 = arg min l£ ||y 4 - I /?(-, s)X i (s) ds\\ 2 + A f ||/3(t, -)\\% K dt. (2) 

We implicitly assume that the expression \\/3(t, ')Wjj K dt is valid, that is \\/3{t, -)\\h k 
as a function of t is square integrable. This assumption on (3 is also more succinctly 
denoted by G L 2 x iJ^. 

The following representer theorem is useful in computing the solution, whose proof 
is omitted since it is standard. 

Proposition 1 The solution of can be expressed as 

n „i 

j3(t,s) = J2 Ci & / (3) 
<=i Jo 

Based on the previous proposition, by plugging the representation (j3J) into (j2J), it 
can be easily shown that (ci(i), . . . , c n (t)) T — (E + nA)~ 1 F(t) where S is an n x n 
matrix whose entries are given by E^- = J J Xi(s)K(s,t)Xj(t)dsdt. 

Remark 1 Throughout this section, we assume the reproducing kernel K is positive 
definite and the RKHS norm for Hk is used in the penalty. More generally, for 
practical use, we can assume Hk = Hi © H 2 , where Hi, typically finite dimensional, 
is a RKHS with reproducing kernel K\ and Hi is a RKHS with reproducing kernel K 2 , 
K = Ki + K 2 . We can then impose the penalty j \\P 2 {3(t, -)\\ 2 H dt = j \\P 2 (3(t, -)\\ 2 H2 dt, 
where P 2 is the projection onto H 2 . Our theory and computation can be easily adapted 
to this more general case, but we use (TJ|) for ease for presentation throughout the 
paper. In real data analysis, H K = W 2 er is the second-order Sobolev space of periodic 
functions on [0, 1] and we use decomposition Hk = Hi © H% where Hi contains the 
constant functions. 
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Since 0(t, •) G H K , there exists f(t, s) such that 0(t, •) = K^ 2 f(t, •) and \\(3(t, -)\\ Hli 
\\f(t, -)||. Thus ([2]) can also be written as 



/ = arg min 

/eL 2 [o,i] 2 n 



H i=l J o 



fi-^KK^X^dsf + X 



1 rl 



JO 



f{t,s)dsdt. (4) 



Due to the appearance of K l l 2 Xi in the expression above, this suggests that the 
spectral decomposition of T := K X I 2 YK X I 2 plays an important role. Suppose the 
spectral decomposition of T is 



T 



s j e j 



"0 ' 



with Si > s 2 > ■ • • > 0. 

The following technical assumptions are imposed. 

(Al) There exists a positive, convex, decreasing function <p : (0, oo) — > R + such that 
Sj = at least for large j. 

(A2) Recall the Karhunen-Loeve expansion K l / 2 X = Y j .. ^< r There exists a con- 
stant c such that E[£f] < c{E[£ 2 }) 2 for all j > 1. 

(A3) ■) G for all t G [0, 1], and •)|| ifjf G L 2 as a function of t. Further- 
more, BK~ X I 2 is a Hilbert-Schmidt operator, where the operator B is defined 
by Bf = j(3(-,s)f(s)ds,f£L 2 . 



Assumption (Al) also appeared in 



Cardot et al 



(120071 ) 



Cai and Yuan 



torn 



considered a much more restrictive polynomial decay assumption Sj x j~ 2r for some 
r > 0, which corresponds to 4>(x) = x~ 2r . Taking <p{x) = C\e~ C2X for some constants 
ci, C2 > 0, exponential decay of eigenvalues is also a special case of our result, among 
many others. 



Ass umption (A2) is simila r to that assumed in lHall and Horowita (120071 ) 



Cardot et al 



(I2007h . ICai and Yuanl d2012f ) assumed that E(f X{t)f{t)dt) A < c(E(f X{t)f{t)dt) 2 ) 2 
for all / G L%. This assumption implies (A2) which can be seen by choosing 
/ = K^ ej . 

(A3) is a natural extension of the case with scalar reponse, where /3(t) G auto- 
matically implies K^ 1 ^ 2 f3 G L 2 . Superficially, BK~ X I 2 in (A3) is only defined on the 
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range of K 1 ^ 2 , which coincides with Hk and is a dense subset of L 2 . Also, since K^ 1 ^ 2 
is an unbounded operator, it is not clear that BK^ 1 ! 2 can be bounded. Nevertheless, 
it can be shown that under the condition that j3(t, •) G Hk and \\/3(t, -)\\h k G L 2 , 
BK -1 ! 2 is bounded on L 2 . More specifically, we have the following proposition whose 
proof is in the Appendix. 

Proposition 2 If/3(t, •) G H K for all t G [0, 1] and \\/3(t, ■)\\ Hl< G L 2 where \\0(t, -)\\h k 
is regarded as a function oft, then BK~ X I 2 is a bounded operator on L 2 . 

The risk we consider is the prediction risk E*\\B(X*) — B(X*)\\ 2 where X* is a 
copy of X independent of the training data and E* is the expectation taken over X*. 
We first present the upper bound. 

Theorem 1 Under assumptions (A1)-(A3), and that A — > 0, \n — > 00, we have 



E*\\B(X*) - B(X*)\\ 2 = O p [ \ + -J2 



s 2 



Remark 2 By examining the proof carefully, one can actually see that the conver- 
gence is uniform in (3 that satisfies (A3) with \\BK~ 1 I 2 \\hs — 1 (there is nothing 
special about the upper bound 1, which can be replace by any L > 0). We can thus 
actually show 

lim lim sup P(E*\\B(X*) - B(X*)\\ 2 > a\ ) = 

a^oo n-*x> ^ e L 2 xH K ,\\BK-^\\ HS <l 

This expression is put here for easy comparison with the lower bound obtained in 
Theorem^ below. 

We now discuss how to choose appropriate A to balance the two terms in the rate 
above. Let J = |_<^> 1 (A) J be the integer part of 0~ 1 (A). By splitting the sum over j 
into j < J and j > J, we have 



-E 



n (sj + A) 2 n n\ 2 
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Let Aq be the solution to the equation 



</>-\\)=n\. 

Then we have J := L < / , ~ 1 ( / ^o)J < -1 (Aq) an d 

sJo+i Ei>j 0+ i Sj < (Jo + 2)4 0+1 ^ J + 2 



(5) 



nA 2 , 



nA 2 , 



< 



where we used that ^ J>Jo+1 < ( Jo+2)sj +i obtained from Lemma 1 of lCardot et al. 
( 120071 ). and that s Jo+2 = 0(J O + 2) < 0(0 _1 (A O )) = A by the definition of J . Thus 
we have 

E*\\B(X*) - B(X*)\\ 2 = O p (A ) 

with Ao defined by (JSJ), which characterizes the optimal convergence rate. In the 
special case <t>(x) = x ~ 2r , Aq = n~ 2r '^ 2r+1 \ which is the same as the rate obtained in 



Cai and Yuan 



(120121 ) for scalar response models. On the other hand, if <fi(x) = e~ x , 
we can easily show that loglogn/n < Ao < logn/n, an almost parametric rate. 

We now establish the lower bound. This is obtained by first reducing the problem 
to the scalar response mod e l and then using a slightly different construction from 
that used in bai and Yuanl (bold ) to deal with more general The details of the 



proof are contained in the Appendix. 

Theorem 2 Under assumptions (Al) and (A2) on the predictor distribution, we 
have, for any a > 



lim lim inf sup P(E*\\B(X*) - B{X*)\\ 2 > a\ ) = 1 

->On-KX) /3tEL2XH K ,\\BK-V 2 \\ HS <l 

where the infimum is taken over all possible estimators based on the training data 
(Xi,Yi),i = 1, . . .,n. 
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3 Numerical Results 



3.1 Simulations 



The simulation setup is similar to that used in ICai and Yuan! (120121 ) . We consider 
the RKHS with kernel 

K ( s , t) = T—u cos(jvrs) cos(jTrf), 
j>i ^ J7r ' 

and thus Hk consists of functions of the form 

/(t) = £/;cos(;7rt) 

3>1 

such that Yljj^fj < 00 • m this case, we actually have ||/||# K = f(f") 2 - Data are 
generated from (pQ) without the intercept term, with 

= ^4V2(-iy^Mcos(j7r S ). 
i ^ 

For the covariance kernel, we use 

r(s, t) = 26>j cos(j7rs) cos(j7rt), 
i>i 

where 8j = (\j — jo\ + 1)~ 2 - When j = 1, the two kernels are perfectly aligned, in 
the sense that they have the same sequence of eigenf unctions when ordered according 
to the eigenvalues. As jo increases, the level of mis-alignment also increases and we 
expect that the performance of functional PCA approach deteriorate with jo- After 
finding the integral Zi(t) := j (3(t, s)Xi(s)ds (approximated easily by a Riemannian 
sum), we discretize Zj over [0,1] on an equally-spaced grid (t 1; . . . , t 100 ) with 100 
points and then add independent ~ N(0,a 2 ) noises to finally obtain = 
Ziitk) + €ik- The discretized data for model fitting contains (Xi(tk),Yi(tk)),k = 
1, . . . , 100, i = 1, . . . , n. We set n = 50, 100 and a = 0.1, 0.3, resulting in a total 
of four scenarios for each j . For values of jo, we use jo G {1,3,5,.. .,15}. For the 
functional PCA approach, the tuning parameter is the truncation point which we 
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consider in the range from 1 to 25. For the RKHS approach, the tuning parameter is 
A and we consider A G exp{— 20, —19, . . . , 0}. The experiment for each scenario was 
repeated 100 times. 

In this simulation, the tuning parameters are chosen to yield the smallest error 
to reflect the best achievable performance for both methods. To assess the perfor- 
mance, 100 test predictors X*, . . . ,X± Q0 are generated from the same model as the 
training data, and root mean squared error (RMSE) is defined to be || J (3X* — 
J (3X*\\ 2 /lOOy 1 / 2 . Simulation results are summarized in Figure [TJ which shows the 
RMSE for both methods. Each panel corresponds to a pair of values of (n, a), and the 
curves show the RMSE averaged over 100 replications for both methods as jo increases 
(red curve for the functional PCA approach and black curve for the RKHS approach). 
The vertical bar shows ± 2 standard errors computed from the 100 replications. 

It is clearly seen that the performance of the RKHS approach is similar to (actually 
better than) that of the functional PCA approach for j = 1. As jo increases, the 
performance of the functional PCA approach becomes much worse, while the errors 
for the RKHS approach remain at the same level. The difference in performance 
between these two methods generally increases with jo (with some exceptions in our 
particular simulations). 

3.2 Real data 

We now turn to the prediction performance of the proposed method on two real 
datasets. These datasets are used frequently in functional data analysis, and both 
are available from the fda package in R. 

Canadian weather data. The daily weather data consists of daily temperature 
and precipitation measurements recorded in 35 Canadian weather stations. Each ob- 
servation consists of functional data observed on an equally-spaced grid of 365 points. 
We treat the temperature as the independent variable and the goal is to predict the 
corresponding precipitation curve given the temperature measurements. As is pre- 
viously done, we set the dependent variable to be the log-transformed precipitation 
measurements, and a small positive number is added to the values with precipitation 
recorded. Given the periodic nature of the data, we set H K = W% er , the second-order 
Sobolev space of periodic functions on [0,1]. The reproducing kernel is given by 



10 



n=50,sig=0.1 n=100,sig=0.1 




2 4 6 8 10 12 14 2 4 6 8 10 12 14 

jO jO 

Figure 1: RMSE for both the functional PCA method (red curve) and the RKHS 
method (black curve) for the simulated data using the optimal tuning parameters. 
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Figure 2: Leave-one-out prediction error for the real data. The x-coordinate for each 
point shows the error of the functional PCA method, and the y-coordinate shows 
the error of the RKHS method. Left panel: Canadian weather data; Right panel: 
Gait data. The tuning parameters are chosen to minimize the leave-one-out cross- 
validation error in both methods. 

K(s,t) = K x {s,t) + K 2 (s,t) with K x {s,t) = l,K 2 (s,t) = £,->i ^ cos(27ri(s - 1)). 
The modification as mentioned in Remark [I] is used. We use leave-one-out cross- 
validation to determine the best tuning parameters to use for both methods. The left 
panel in Figure [2] shows the prediction errors on the 35 stations using the best tuning 
parameters (trained on 34 stations). For 20 stations, the functional PCA approach 
has larger error than the RKHS approach. The average mean prediction error for the 
functional PCA approach is 0.43 while the error is 0.40 for the RKHS approach. 

Gait data. The Motion Analysis Laboratory at Children's Hospital, San Diego, 
collected these data, which consist of the angles formed by the hip and knee of 39 
children over each child's gait cycle. The cycle begins and ends at the point where 
the heel of the limb under observation strikes the ground. Both sets of functions are 
periodic and it is of interest to see how the two joints interact. In this application, 
we use hip angle as the predictor and knee angle as the response. The right panel 
in Figure |2] shows the prediction errors on the 39 children. For 21 children, the 
functional PCA approach has larger error than the RKHS approach. The average 
mean prediction error for the functional PCA approach is 4.49 while the error is 4.38 
for the RKHS approach. 
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4 Conclusion 



In this paper, we established the minimax rate of convergence for prediction in func- 
tional response models in the general setting where the covariance kernel T and the 
reproducing kernel K are not aligned, and also under general assumption on the de- 
cay rate of the eigenvalues of operator T = K X I 2 YK X I 2 . Our simulations show that 
as the degree of alignment of the two kernels decreases, the RKHS estimator can sig- 
nificantly outperform the estimator based on functional PCA. The two real datasets 
further demonstrate that the RKHS estimator can have better prediction accuracy. 
Choice of tuning parameter A can be don e via c ross-validation, as illustrated in 



our analysis of the real data. ICai and Yuan! (120121 ) proposed an adaptive method 



for tuning parameter selection which is an important theoretical development, but in 
our experience does not work as well as cross-validation. Theoretical development of 
a good tuning parameter selector can be of significant importance which we do not 
investigate here. 

Furthermore, one naturally wonders whether a similar RKHS approach can be 
extended to sufficient dimension reduction such as functional sliced inverse regression 
(SIR), which was also traditionally based on functional PCA which assumes that 
the projection direction of interest is well-represented by the basis obtained from 
functional PCA. It is interesting to see whether the more general framework can lead 
to better performance in functional SIR. 



Appendix: Proofs 

Proof of Proposition [2), Let {ujj}'jL l be the eigenfunctions of K correspond- 
ing to the eigenvalues cti > a 2 > ■ ■ ■ > 0. Since (3(t,-) G H K , we can write 
P(t, s ) — J2j a j('t) U! j{ s ) f° r some function ctj, with YlJLi a j(^)/ a j < 00 (pointwise 
summable in t). For any / = J2j fj u j e Hk, BK~ x l 2 j = J2j(fj/y/®j) a j- Using 
this representation, BK~ X I 2 can be natually extended to L 2 by defining BK~ X I 2 f = 
J2j(fj/ \/®j) a j e -^2 for any / G L 2 . Using Cauchy-Schwartz inequality, this opera- 
tor is obviously bounded on L 2 since the assumption that \\j3(t, -)\\h k £ L 2 implies 
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Proof of Theorem [T], In the proofs we use C to denote a generic positive constant. 
Using (3(t r ) = K 1 / 2 f(t r ), from ®, 

n 

where I is the identity operator, T n = K X I 2 Y n K x l 2 and T n = <8 Xi)/n is the 

empirical version of T. Using Yi = B(Xi) + e h and noting that T n = E 4 (^ 1/2x i ® 
K^XA/n, we have 

B(X*) - BpT) 

n n 

= sx-^ ( Tn (T n + a/)- 1 - /) fc 1//2 x* + Eife^^ 1/2 ^) (Tw + xiy^K^X* 

' n 
= -\BK-V 2 (T n + \I)~ l K 1/2 X* + ^^® Kl/2Xi \ Tn + A/)- 1 ^ 1 / 2 ^* 



n 



We first deal with A 1 . Note Ai = -\BK- X I 2 (T + A/) _1 K 1/2 X* - \BK- x l 2 (T n + 
A/)-i(T - T„)(T + A/)- 1 ^ 1 / 2 ^*. 

Using the expansion K X ' 2 X* = YliCj e 3i 



X 2 E*\\BK- 1,2 {T + \I)- l K 1,2 X*\\ 2 
X 2 E* 



^ ^E<^ 1/2e - e *> 2 

= \\\BK-y 2 \\ 2 HS . (6) 
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Also, writing A = BR-^^n + Xiy 1 ^ -T^T+XI)' 1 for simplicity of notation, 

\ 2 E*\\BK~ 1/2 (T n + \I)-\T - T n )(T + \I)- 1 K^ 2 X*f 
= \ 2 E*(AK l/2 X*,AK l/2 X*) 
= \ 2 E*(A T AK 1/2 X*,K 1/2 X*) 
= A 2 Trace(^ T ^T) 



A 2||^ T l/2||2 



HS 



\hs 



< X 2 \\BK~ 1/2 \\ 2 HS \\(T n + \I)-\T - T n )(T + \I)~ l T l/2 \\] 

< X 2 \\BK^ 2 \\ 2 HS \\(T n + A/npKT - T n )(T + XTj-^Yss 

= O p {\\iT-T n ){T + \I)-^/ 2 \\ 2 HS ). (7) 



We have 



\hs 



EUT-T^T + XI)- 1 ^ 2 ]]] 
EJ2((T- T n )(T + \I)- l T l / 2 e 3) e k f 



j,k 

s 1/2 

= Ej2((T-T n )-^—-e J ,e k ) 2 . (8) 



Direct calculation reveals that 



£((T-T n )e,,e fc > 2 
= E ( s j e j - -$^(($^^ e ® (5^^me m ))e i ,e fc ) i 



n 

■m 



I ll 



= E(s,I{j = k}- ^^ tk ) 2 



n 

where the last step used the fact that E[£ij£ ik ] = Sjl{j = k}. Using assumption (A2), 
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we have E((T — T n )ej, e^) 2 < CsjSk/n, which combined with (JSJ) implies 



C 



E\\(T-T n )(T + XI)-^/YHS<-J2 



s)s k 



n ^ ( Sj + A) 2 



©,© and © together yield E*^ 2 = O p (X + J j^fo)- 

Now, write A 2 = E^^l (T + A/) _1 /^ 1//2 X* - £^^)( T + XI) 
T n )(T n + XI)- 1 K l / 2 X*. We have 

E *\\ ^ ei ® Kl/2Xi) (T + xiy 1 K 1/2 x*\\ 2 



n 



i 1 



and thus 



fill 



n 



°1e 

n 



Sj + A 



5l 



t2 <r*2 

( Sj + xy 



-E 
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where a 2 = E\\e\\ 2 . Furthermore, denoting C = (T + A/) -1 (T - T n )(T n + A/) -1 , 

,Ei(ei®^ 1/2 ^i) 



-(T n + A/)- 1 ^ - T n )(T + A/)~ 1 if 1/2 X*|| 2 |X 1 , . . . ,X„ 



rr 



4* 



a 



J2(K 1/2 *i, (Tn + XI)-\T - T n )(T + Xiy'K^X*) 2 ^, ...,X n 

i 

Y,(CK 1 ' 2 X h K 1 / 2 X*) 2 \X u ...,X n 

i 



^Trace(C T rCT„; 



n 

^Trace(T n 1 / 2 C T T 1 / 2 T 1 / 2 CT„ 1/2 ) 
n 

o 2 



a 



^\\Tl' 2 {T n + \I)-\T - T n )(T + A/) _1 T 1/2 || 



n 
2 



HS 



< ^||ry 2 (T n + A/)" 1 !! 2 || (T - T n )(T + A/) _1 T 1/2 || 



a 

i 



2 

HS 1 



) 



where we used ([9]) and that nX — >■ oo. Thus we have -E*||A 2 || 2 = E/ (g-+A) 2 )' 
The theorem is proved by combining the bounds for E^H^H 2 and -E^H^H 2 . □ 



Proof of Theorem [2J Our model is Y(t) = J j3(t, s)X(s)ds + e(t). Consider the 
special case f3(t,s) = e\(t) ® (3(s) and e(t) = ei(t)x, where (3(s) G Hk, \\P\\h k < 1) 
and x ~ iV(0,<7 2 ). Then by taking inner products with on both sides of Y(t) = 
J f3(t, s)X(s)ds + e(t), the model becomes F« = J f3(s)X(s)ds + X , Y^ = Y& = 
■ ■ ■ = where = (Y,ej). Since || / s)X(s)ds\\ = | / P(s)X(s)ds\, the lower 
bound for the scalar response model provides a lower bound for the functional response 
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model. Thus we can just consider the model with scalar response: 

Yi = J /3(s)X l (s)ds + Xl , 



with 



l/^H-ffjf — 1- We need a modification of the proof of Theorem 1 in 



Cai and Yuan 



( 120 12l ) due to the more general assumption on the eigenvalues of T. Let rjj 
\/ c\n/(J()Sj) for so me < c < 1 to be determined later. We apply Theorem 2.5 
of" 



Tsybakovl (120091 ) using the following collection of 2 Jo functions 



•/<) 



k=l 



where 9 = (0 1? ... ,6j ) G {0, 1} J °. 

First, using that \\K 1 / 2 e jl K l / 2 e k \\ HK = (ej,e k ) = l{j = k}, 



life 



1 2 

\H K 



Jo Jo \ Jo 



cA Jo 



k=l 



k=l 



Jo ~. Sk Jo Sj 



< C < 1, 



since sj > A by s Jo = 4>{Jq) and the definition J Q = \ j) 1 (A n ) | ■ 

By the Varshamov-Gilbert bound (Lemma 2.9 in iTsybakovl (120091 )). there is a 

subset 6 = {9°,...,9 N } C {0,1} J ° such that 6° = (0,...,0), N > 2 J °/ 8 and 

EtiiQk ~ 0' k ) 2 > Jo/8 whenever 
We have 



Jo 



\T 1/2 (fo-fe')\\ 2 = J2^- e '^ s ^ 



fc=i 



cAp Jq 
~Jo~~8 



cA /^ 



verifying condition (i) in Theorem 2.5 of lTsybakovl (120091 ) . Furthermore, the Kullback- 
Leibler distance between Pg and Pg* (Pg is the joint distribution of training data when 
f3 = fg) can be found to be 



fc=i 



and thus 
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1 N 
-Y 



K(P e \P e >) < 



nc\ ccj) 1 (A ) 



< 7T-?(^o + 1) < oAogN, 



for some < a < 1/8 i 



2.5 of 


Tsv 


3akov 


Tsvbakov ( 


2009) 



2a 2 2a 2 ~ 2a 2 

c is chosen small enough, verifying condition (ii) in Theorem 
. The lower bound is proved by applying Theorem 2.5 of 

□ 
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